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DICE is a general purpose multidimensional numerical integration package. There can be two ways in the 
parallelization of DICE, "distributing random numbers into workers" and "distributing hypercubes into workers" . 
Furthermore, there can be the combination of both ways. So far, we had developed the parallelization code 
using the former way and reported it in ACAT2002 in Moscow. Here, we will present the recent developments of 
parallelized DICE in the latter way as the 2nd stage of our parallelization activities. 



1. Introduction 

Recently it is not rare to calculate the cross sec- 
tions of the physics process with over 6 final state 
particles in the tree level. In such calculations, 
there may appear singularities close to diagonal 
integral region and sometimes it is very difficult 
to find a good set of variable transformations to 
get rid of the singularities. For the one-loop and 
beyond the one-loop physics processes, when we 
try to carry out the loop calculation only in the 
numerical approach, we need several multidimen- 
sional integration packages or another integration 
method to compare the numerical results to check 
them. For such a request, DICE has been devel- 
oped by K.Tobimatsu and S.Kawabata. It is a 
general purpose multidimensional numerical in- 
tegration package. 

1.1. The non-parallelized version of DICE 

The first version of DICEfJ appeared in 1992 
and is a scalar program code. In DICE, the in- 
tegral region is divided into 2^'''™ hypercubes 
repeatedly according to the division condition. 
To evaluate the integral and its variance in each 
hypercube, DICE tries two kinds of sampling 
method, a regular sampling and a random sam- 
pling as: 
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1. Apply regular sampling and evaluate the 
contribution. 

And then check the division condition is sat- 
isfied or not. 

2. Apply 1st random sampling and evaluate 
the contribution. 

And then check the division condition is sat- 
isfied or not. 

3. Apply 2nd random sampling and evaluate 
the contribution. 

For an integrand with singularities the num- 
ber of above repetitions becomes huge so rapidly 
and the calculation time becomes a long time. To 
reduce the calculation time, the vectorized ver- 
sion of DICE (DICE 1.3Vh|2) has been devel- 
oped in 1998 for vector machines. In the vec- 
tor program code, the concept of workers and the 
queuing mechanism are introduced. This vector- 
ized DICE has succeeded the reduction of the cal- 
culation time for the integration even when the 
integrand has strong singularities. 

Today, however, the vector processor architec- 
ture machines have dropped off and instead the 
parallel processor architecture machines become 
common in the field of High Energy Physics. 
Moreover, the cost effective PC clusters running 
Linux with distributed memory or shared mem- 
ory are widely spread. Thanks to this rapid rise 
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of PC clusters with the ParaUel hbrary such as 
MPI0! or with OpenMP 4 , the paraUel program- 
ming is very famihar to us. 



formance. The result of the efficiency measure- 
ment by this parallelization way was presented at 
ACAT2002 at MOSCOW 0. 



2 . Par allelizat ion 



2.1. Profile of DICE 

To get a good efficiency in the parallelization, 
it is important to know which routines are time- 
consuming. UNIX command gprof is a useful 
tool to know it. In Tabled an example output 
of gprof command for the calculation of the inte- 
gration by the non-parallelized DICE. This calcu- 
lation is done on the Alpha 21264 processor (700 
MHz clock speed) machine running Linux and the 
compiler used is Compaq Fortran. In Table ^ 
the most time-consuming routine is elwks and is 
called in func. In elwk and func the integrand 
function is given. The subroutine func is called 
repeatedly in regular, randoml and rajidom2 to 
evaluate the integrand. Here, vbrndm is a routine 
to generate random numbers and is called in both 
raiidoml and random2. 

In summary, it is expected that distributing the 
calculations in randoml and random2 into workers 
(processors) may be efficient to reduce the calcu- 
lation time. 

2.2. Algorithm 

For the integrand with strong singularities, the 
region is divided into a large number of hyper- 
cubes and the total number of random numbers 
are required to get the integral results with the re- 
quested errors. Therefore, there can be two ways 
in the parallelization, the way of distributing ran- 
dom numbers and the way of distributing hyper- 
cubes to workers. 

As the 1st step we have started the paralleliza- 
tion of DICE with the former way, distributing 
random numbers. The schematic view of the al- 
gorithm with the former way is shown in Fig. ^ 
There, it is shown how random numbers are dis- 
tributed into workers in randoml and random2. 
The merit of this approach is not only that the 
algorithm is very simple as shown in Fig. ^ but 
also that the overhead due to data transfer or 
load unbalancing among workers. The efficiency 
of this parallelization have showed very good per- 
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Figure 1. Schematic view of the algorithm of 
parallelized DICE with distributing the random 
numbers into workers, wl, w2, w3 and w4, for 
example. Each worker is responsible for the part 
of random numbers in the routines, randoml and 
rajidom2. The total number of random numbers 
is the sum of random numbers treated in each 
worker as Ntotai = nl + n2 + n3 + nA. 



As the 2nd step, here in this paper, we present 
the parallelization with the latter way, distribut- 
ing hypercubes into workers. As the 3rd step, the 
final step, we have a plan of the combination of 
both ways. 

3. Implementation 

In this parallelization, hypercubes are dis- 
tributed into workers. After the evaluation, the 
results are gathered to the root process (for ex- 
ample, worker 1). And then the root process 
scattered the results to all workers. In Fig. |21 a 
schematic view how calculations are distributed 
into workers is shown. 
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Table 1 



gprof output 


Flat profile of 


non-parallelized DICE 








Time 


Cumulative 


Self 


Calls 


Self 


Total 


Name of 


% 


time [sec] 


time [sec] 




[/i/call] 




routines 


82.95 


7.60 


7.60 


26214 


0.29 


0.29 


elwks_ 


12.41 


8.73 


1.14 


26214 


0.04 


0.33 


f unc_ 


2.52 


8.96 


0.23 


3072 


0.08 


0.08 


vbrndm_ 


0.93 


9.05 


0.08 


1536 


0.06 


2.80 


randm2_ 


0.92 


9.13 


0.08 


1536 


0.05 


2.79 


randml_ 


0.13 


9.14 


0.01 


1638 


0.01 


0.34 


regular. 



This calculation of the integration is done on the Alpha 21264/700MHz machine by the Compaq Fortran 
for Linux. Total CPU time required was 9.16 sec in total. This integration was done with expected error 
= 10%. 
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Figure 2. Schematic view of the algorithm of Par- 
allelized DICE. Each worker is responsible for the 
part of hypercubes. 



4. Efficiency Measurement 

4.1. Measurement Environment 

In Table [21 the measurement environment is 
shown. There, MPI bandwidth was measured and 
was 56.79 MB/s. It is measured by a simple ping- 
pong program using MPI send-receive functions. 
The size of the transferred data is 1 MB and the 
figure is an average by 10 times measurement. In 
all measurements we use a cheap Gigabit Eth- 
ernet switch and it can be said that the switch 
showed a reasonable performance. 



Table 2 

Measurement Environment 



PC cluster 



CPU 


Xeon dual 3.06GHz 


# of systems 


8 systems 


Memory 


2 [GB] 


Switch 


10/100 /lOOO switch 


Compiler 


/usr /local / mpich-intel8 1 / 
/bin/mpif77 


MPI Bandwidth 


56.79 [MB/s] in average 



In our implementation we use Fortran compiler 
since DICE is written in Fortraji and we chose 
MPl|2j as the parallel library. 



4.2. Example Physics process: e^e — > 

We choose the radiative muon pair production 
as an example physics process to measure the effi- 
ciency of the parallelization. This physics process 
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Table 3 

Summary of the parameters in the measurements 



Physics process 


e+e M^M 7 


EcM 


70 [GeV] 




100 [MeV] 


kinematics 


naive kinematics 


^ of dimensions 


4 


^ of random numbers 




in each hypercube 


100 


Max. # of workers 


8 



is a reahstic example and it must be a good ex- 
ample since there are several singularities in the 
calculation of the cross section. That is, there are 
the mass singularities in the initial electron and 
positron, those for the final muons, the s-channel 
singularities caused by nearly on-shell s-channel 
photon by a hard photon emission from the ini- 
tial electron or positron, and the singularity by 
the infra-red divergence of a real photon which is 
regularized by introducing a cutoff for the photon 
energy k^. 

In Table 13 we summarized the parameters in 
the measurements. There we used the naive kine- 
matics which means the kinematics without find- 
ing a good set of variables. That means there 
still exist several singularities in the integrand. 
The details of the naive kinematics are shown 
in Ref 2 . As a matter of course, the studies 
of the kinematics for this example process have 
been well done and we should add that there ex- 
ists one program code of the kinematics with a 
well selected set of variables to avoid the strong 
singularities[n|. 

4.3. The Wall-Clock Time Measurement 

Roughly speaking, in the parallel calculation 
the CPU time in each worker must decrease 1/2, 
1/4, and 1/8 when the number of workers in- 
creases as 2, 4, and 8. This is the basic check 
whether the parallel code runs well. Actually the 
reduction rate of the wall-clock time is a more im- 
portant measure than the reduction rate of CPU 
time to see how efficient the parallel code is. 

In Table^land [S] the measured wall-clock time 
are shown for the calculations of the cross section 



with requested errors which are 1% and 2% re- 
spectively. In both Tables, the wall-clock time 
becomes shorter and both reduction rates are in 
the same manner. However, the reduction rate 
is not good enough when the number of workers 
increases. 

5. Summary and Outlook 

We presented recent developments on the par- 
allelization of DICE. The ongoing work is in the 
2nd stage of our activities for the paralleliza- 
tion. Efficiency of the current parallel code has 
been evaluated for the example physics process 
e+e~ — > /x+/i~7 with naive kinematics. For this 
process, the wall-clock time was actually reduced 
with the current parallel code but the reduction 
rate is not satisfactory when the number of work- 
ers increases. So there is still some work remained 
to optimize the current parallel code further. 

Our current code is based on the vectorized 
DICE, DICE 1.3vh, and in it the load balancing 
mechanism between workers is not included. We 
believe that further more reduction of the wall- 
clock time will be possible with applying the load 
balancing mechanism to our current code. 

The main goal of all efforts is the paralleliza- 
tion using the combination of both distribution 
ways. After including the load balancing mecha- 
nism we will be able to enter the 3rd stage of the 
parallclization. 
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Table 4 



Efficiency of the Parallelized DICE : 
(2.5824 ± 0.0508) x lO^^ [nb ] 


The calculation of the 


cross section with the 


error — 1.97%. a — 


# of Processors 


CPU time 


wall- clock time 


Reduction rate: 


Reduction rate: 


(Workers) 


[sec] 


[sec] 


CPU time 


wall-clock time 


1 


8882.03 


8894 


1.00 


1.00 


2 


5179.62 


6178 


0.58 


0.69 


4 


3308.92 


5011 


0.37 


0.56 


8 


2394.34 


4784 


0.27 


0.54 


Xeon 3.06 GHz, non-parallelized DICE, required CPU time is 8704.84 sec. 




Table 5 

Efficiency of the Parallehzcd DICE 
(2.8307 ± 0.0259) x lO^^ [nb ] 


: The calculation of the cross section with error = 0.91%. a =- 


# of Processors 


CPU time 


wall-clock time 


Reduction rate: 


Reduction rate: 


(Workers) 


[sec] 


[sec] 


CPU time 


wall-clock time 


1 


186682.83 


197444 


1.00 


1.00 


2 


109401.92 


134377 


0.59 


0.68 


4 


69676.01 


108234 


0.37 


0.55 


8 


51056.55 


103126 


0.27 


0.52 



Xeon 3.06 GHz, non-parallelized DICE, required CPU time is 183884.34 sec. 
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